STATS 763: Assignment 1: Cycling in the Rain

Tarin Eccleston

07/08/2024

Data Manipulation

Based on the brief, we would like these columns in our dataframe for modelling:

  • date (used for reference)
  • cycle_count
  • weighted_average_rainfall
  • cycle_recording_sites
  • covid_status
  • year
  • month
  • day_of_week

Read Data

auckland_cycle_2018 <- read.csv("data/aucklandcyclists2018.csv")
auckland_cycle_2019 <- read.csv("data/aucklandcyclists2019.csv")
auckland_cycle_2020 <- read.csv("data/aucklandcyclists2020.csv")
auckland_cycle_2021 <- read.csv("data/aucklandcyclists2021.csv")
auckland_cycle_2022 <- read.csv("data/aucklandcyclists2022.csv")

monitoring_locations <- read.csv("data/monitoringlocations.csv")

daily_rainfall <- read.csv("data/daily-rainfall-at-30-sites-1960-to-2022.csv")
daily_rainfall_data_dict <- read.csv("data/daily-rainfall-at-30-sites-1960-to-2022-dictionary.csv")

Standardise Date Format

We need all date formats to be in YYYY-MM-DD, and as date objects.

auckland_cycle_2018 <- auckland_cycle_2018 %>%
  mutate(date = as.Date(Date, format = "%Y-%m-%d")) %>%
  dplyr::select(-Date)

auckland_cycle_2019 <- auckland_cycle_2019 %>%
  mutate(date = dmy(str_remove(Date, "^[A-Za-z]+,\\s*"))) %>%
  mutate(date = as.Date(date, format = "%d-%m-%Y")) %>%
  dplyr::select(-Date)

auckland_cycle_2020 <- auckland_cycle_2020 %>%
  mutate(date = dmy(str_remove(Date, "^[A-Za-z]+\\s+"))) %>%
  mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
  dplyr::select(-Date)

auckland_cycle_2021 <- auckland_cycle_2021 %>%
  mutate(date = as.Date(Date, format = "%Y-%m-%d")) %>%
  dplyr::select(-Date)

auckland_cycle_2022 <- auckland_cycle_2022 %>%
  mutate(date = as.Date(Date, format = "%Y-%m-%d")) %>%
  dplyr::select(-Date)

Pivot Data and Merge

auckland_cycle_2018 <- auckland_cycle_2018 %>%
  pivot_longer(cols = -date, names_to = "cycle_recording_site", values_to = "cycle_count") %>%
  drop_na(cycle_count) %>%
  mutate(cycle_count = round(cycle_count))

auckland_cycle_2019 <- auckland_cycle_2019 %>%
  pivot_longer(cols = -date, names_to = "cycle_recording_site", values_to = "cycle_count") %>%
  drop_na(cycle_count)

auckland_cycle_2020 <- auckland_cycle_2020 %>%
  pivot_longer(cols = -date, names_to = "cycle_recording_site", values_to = "cycle_count") %>%
  drop_na(cycle_count)

auckland_cycle_2021 <- auckland_cycle_2021 %>%
  pivot_longer(cols = -date, names_to = "cycle_recording_site", values_to = "cycle_count") %>%
  drop_na(cycle_count)

auckland_cycle_2022 <- auckland_cycle_2022 %>%
  pivot_longer(cols = -date, names_to = "cycle_recording_site", values_to = "cycle_count") %>%
  drop_na(cycle_count)

auckland_cycle <- rbind(auckland_cycle_2018, auckland_cycle_2019, auckland_cycle_2020, auckland_cycle_2021, auckland_cycle_2022)

auckland_cycle <- auckland_cycle %>%
  mutate(cycle_count = round(cycle_count)) %>%
  mutate(cycle_recording_site = as.factor(cycle_recording_site))

We drop all NAs. It would be difficult to impute, and we already have quite a bit of data. Though the missing data could have an effect on biasing our estimates.

I’m unsure why we have non-integer counts for cycle counts so I rounded all cycle counds.

Filter for Important Rainfall Data

Two of the rainfall monitoring locations are relevant to Auckland: “Auckland (Auckland)”, close to the airport, and “Whangaparāoa (Auckland)”.

We only need need rainfall data corresponding to cycling data between 2018-01-01 and 2022-12-31.

daily_rainfall_auckland <- daily_rainfall %>%
  filter(site == "Auckland (Auckland)") %>%
  mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
  filter(date >= as.Date("2018-01-01"))

daily_rainfall_whangaparaoa <- daily_rainfall %>%
  filter(site == "Whangaparāoa (Auckland)") %>%
  mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
  filter(date >= as.Date("2018-01-01"))

Calculate Distances between Rainfall Stations and Cycle Recording Sites

auckland_rainfall_station_coordinates <- data_frame(
  lat = -37.00813,
  lon = 174.7887
)

whangaparaoa_rainfall_station_coordinates <- data_frame(
  lat = -36.606,
  lon = 174.835
)

rainfall_stations_to_cycle_recording_sites <- data_frame(cycle_recording_site = monitoring_locations$Location)

rainfall_stations_to_cycle_recording_sites <- rainfall_stations_to_cycle_recording_sites %>%
  mutate(distance_to_auckland = pmap_dbl(
    list(lat = monitoring_locations$Latitude, lon = monitoring_locations$Longitude),
    ~ distHaversine(c(..2, ..1), c(auckland_rainfall_station_coordinates$lon, auckland_rainfall_station_coordinates$lat))
  )) %>%
  mutate(distance_to_whangaparaoa = pmap_dbl(
    list(lat = monitoring_locations$Latitude, lon = monitoring_locations$Longitude),
    ~ distHaversine(c(..2, ..1), c(whangaparaoa_rainfall_station_coordinates$lon, whangaparaoa_rainfall_station_coordinates$lat))
  ))

Calculate Weighted Average Rainfall by Relative Distances between Rainfall Stations and Cycle Recording Sites

auckland_cycle <- auckland_cycle %>%
  left_join(rainfall_stations_to_cycle_recording_sites, by = "cycle_recording_site") %>%
  left_join(daily_rainfall_auckland %>% dplyr::select(date, rainfall) %>% dplyr::rename(rainfall_auckland = rainfall), by = "date") %>%
  left_join(daily_rainfall_whangaparaoa %>% dplyr::select(date, rainfall) %>% dplyr::rename(rainfall_whangaparaoa = rainfall), by = "date") %>%
  mutate(weighted_average_rainfall = (rainfall_auckland / distance_to_auckland + rainfall_whangaparaoa / distance_to_whangaparaoa) /
                        (1 / distance_to_auckland + 1 / distance_to_whangaparaoa)) %>%
  dplyr::select(-c(distance_to_auckland, distance_to_whangaparaoa, rainfall_auckland, rainfall_whangaparaoa))

Another approach would be to find the closest rainfall station and use those measurements for that day at that cycle recording site. However, I believe using the weighted average between the two rainfall stations in Auckland weighted by their relative distances to the cycle recording sites is the best approach as it balances out bias created from having rainfall measurements taken at a vastly different location to the cycle recording sites.

Find Day of the Week, Month, and Year

auckland_cycle <- auckland_cycle %>%
  mutate(day_of_week = as.factor(weekdays(date))) %>%
  mutate(day_of_week = factor(day_of_week, 
                              levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) %>%
  mutate(month = factor(month(date), labels = month.name)) %>%
  mutate(year = as.factor(year(date)))

These covariates should be sufficient to account for seasonality (in addition to covid_status).

Set COVID Status

auckland_cycle <- auckland_cycle %>%
  mutate(covid_status = case_when(
    date >= as.Date("2018-01-01") & date <= as.Date("2020-03-24") ~ "pre-covid",
    date >= as.Date("2020-03-25") & date <= as.Date("2020-04-27") ~ "level-4-lockdown",
    date >= as.Date("2020-04-28") & date <= as.Date("2020-05-10") ~ "level-3-lockdown",
    date >= as.Date("2020-05-11") & date <= as.Date("2020-05-24") ~ "level-2-lockdown",
    date >= as.Date("2020-05-25") & date <= as.Date("2020-06-07") ~ "level-1-lockdown",
    date >= as.Date("2020-06-08") & date <= as.Date("2020-08-10") ~ "inter-lockdown",
    date >= as.Date("2020-08-11") & date <= as.Date("2020-08-29") ~ "level-3-lockdown",
    date >= as.Date("2020-08-30") & date <= as.Date("2020-09-22") ~ "level-2.5-lockdown",
    date >= as.Date("2020-09-23") & date <= as.Date("2020-10-06") ~ "level-2-lockdown",
    date >= as.Date("2020-10-07") & date <= as.Date("2021-02-13") ~ "inter-lockdown",
    date >= as.Date("2021-02-14") & date <= as.Date("2021-03-06") ~ "level-3-lockdown",
    date >= as.Date("2021-03-07") & date <= as.Date("2021-08-16") ~ "inter-lockdown",
    date >= as.Date("2021-08-17") & date <= as.Date("2021-09-20") ~ "level-4-lockdown",
    date >= as.Date("2021-09-21") & date <= as.Date("2021-12-01") ~ "level-3-lockdown",
    date >= as.Date("2021-12-02") & date <= as.Date("2021-12-29") ~ "traffic-red-lockdown",
    date >= as.Date("2021-12-30") & date <= as.Date("2022-09-11") ~ "traffic-orange-lockdown",
    date >= as.Date("2022-09-12") & date <= as.Date("2022-12-31") ~ "post-covid",
  )) %>%
  mutate(covid_status = as.factor(covid_status))

I obtained lockdown and restriction dates from policycommons.ac.nz.

Exploratory Analysis

Causal DAG

I believe there aren’t any relevant colliders or mediators for rainfall on cycle count.

Bivariate Plots

These plots show direct relationships between the concerning covariate with direct effects on cycle_count and confounders of rainfall, supporting our causal diagram.

There was noticeably more cycle counts in 2019. There appears to be an overall downward trend on cycling counts across the years. People often cycle for leisure or transport (usually to and from work). Since COVID-19, there has been a rise of hybrid/remote work, which in combination with lockdowns, could explain the downward trend in cycle counts. We include year in our model. We include year in our model as this should have direct effects on cycle counts and a confounder of rainfall on cycle counts.

There appears to be seasonality - the colder months (June, July, and August) appear to have less cycle total counts compared to warmer months (December, January, February, March). There are unusually low total cycle counts in December, this could be due to the Christmas Holidays. We include month in our model as this should have direct effects on cycle counts and a confounder of rainfall on cycle counts.

It would be interesting to see the trend and seasonality across all dates. Although dates will not be used in our model.

The plot of total cycle counts vs date supports the seasonality seen between month and total cycle count, and downward trend seen between year and total cycle count.

However these trends and seasonality are disrupted during the COVID-19 lockdowns, hence why we need to adjust for the COVID status covariate when fitting our model.

There appears to be seasonality - the middle of the weekday (Tuesday and Wednesday) has the highest total cycle counts, whereas the weekend (Saturday and Sunday) have the lowest total cycle counts. This supports the assumption that people often cycle for leisure or transport (usually to and from work), the middle of the weekday being when people are most likely working. We include day of week in our model as it has direct effects on cycle count.

I averaged the cycle counts across all dates and locations within a given COVID status, this is because each status has different periods. Note that averages can be biased due to other seasonal and trend effects.

There appears to be a higher average cycle count before the COVID-19 pandemic, which decreases significantly starting from the first lockdown in March 2020. The “post-COVID” period also shows a noticeably lower average cycle count. The primary focus of this analysis is to model the relationship between rainfall and cycle counts. Although predicting and explaining cyclist behavior during the COVID-19 period is challenging, we adjust for COVID status in our model to reduce noise as it directly impacts cycle counts. However, our primary attention is not on the inference of the COVID status covariate itself but rather on understanding the relationship between rainfall and cycle counts.

There is non-uniformity between total cycle count vs cycle recording site. Certain cycle recording sites such as NW Cycleway Kingslandm, Quay St. Eco Display Classic, and Quay St. Spark Arena have more cycle counts across the dataset compared to other cycle recording sites such as Matakana. There seems be an effect of rural and urbanised recording sites on total cycle count, with urbanised sites being more popular. We include cycle recording site in our model as it has direct effects on cycle count.

There appears to be some seasonality of year on rainfall, this could be due to El Niño-Southern Oscillation (ENSO). We include year in our model as a confounder for rainfall on cycle count, and don’t need to include ENSO index as year covariate should cover the effects of ENSO.

Note: The bands are caused by taking rainfall measurements on the same day but at different sites with similar amounts of weighted average rainfall.

There appears to be seasonality of month on rainfall - winter months (June, July, and August) having the highest rainfall compared to other months. January, a Summer month, has the lowest rainfall compared to other months. We include month in as a confounder for rainfall on cycle count.

There is uniformity between rainfall vs cycle recording site. However some sites (GI to Tamaki Dr. Section 2, Matakana, New Lynn to Avondale SUP, and Quay St. Totem UZELT) appear to have much lower rainfalls than all other sites, this is possibly due to the missing values in the data.

The uniformity could be due to not having local rainfall data which accounts for micro-climates (i.e. Matakana and Karangahape Rd. should have different micro-climates). We are weighting the average between measurements from only two rainfall stations based on distance, hence why we would expect there to be less variation between rainfall across all cycle recording sites.

Despite the limitation with our data, we still stand to include cycle recording site in our causal diagram as a confounder for rainfall on cycle count. Cycle recording site is still included in our model, however it’s estimates probably tell us more about the direct effects of cycle recording site on cycle count.

A clear negative trend between rainfall and cycle count can be seen across all sites and at each site. Higher rainfall measurements results in less cycle counts, inversely lower rainfall measurements results in more cycle counts.

There seems to be an inflated amount of 0 mm rainfall recordings.

Modelling

Poisson Model

Since we are dealing with counts, we use a Poisson model as a first step, with first order covariates and no interactions.

cycle_count_poisson_model <- glm(cycle_count ~ weighted_average_rainfall + day_of_week + month + year + cycle_recording_site + covid_status - 1, family = poisson(link = "log"), data = auckland_cycle)
cycle_count_poisson_model_summary
## 
## Call:
## glm(formula = cycle_count ~ weighted_average_rainfall + day_of_week + 
##     month + year + cycle_recording_site + covid_status - 1, family = poisson(link = "log"), 
##     data = auckland_cycle)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -41.262   -4.572   -0.732    3.234  193.406  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## weighted_average_rainfall -2.525e-02  4.573e-05 -552.16   <2e-16 ***
## year2019                   4.189e-02  6.641e-04   63.08   <2e-16 ***
## year2020                   8.752e-02  1.049e-03   83.40   <2e-16 ***
## year2021                   1.051e-01  1.387e-03   75.77   <2e-16 ***
## year2022                   3.476e-01  8.077e-03   43.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 239480046  on 79449  degrees of freedom
## Residual deviance:   4361284  on 79366  degrees of freedom
## AIC: 4907324
## 
## Number of Fisher Scoring iterations: 5

The residual deviance appears a lot larger than the degrees of freedom, this suggests over-dispersion. The next step would be to fit a negative binomial model with a dispersion parameter.

The average effects of weighted average rainfall are: for every one-unit (1 mm) increase of rainfall, the expected count of cyclists decreases by a factor of \(e^{-0.02525} \approx 0.975\), or about \(2.5\%\) less.

The fit line looks quite noisy, however we can see that the line follows the trend quite closely with cycle counts decreasing with increasing rainfall measurement.

The variance of the scatter is not constant, it increases as the fitted values increase, suggesting that there is over-dispersion. A solution is use a negative binomial model.

Negative Binomial

cycle_count_negbin_model <- glm.nb(cycle_count ~ weighted_average_rainfall + day_of_week + month + year + cycle_recording_site + covid_status - 1, data = auckland_cycle)
cycle_count_negbin_model_summary
## 
## Call:
## glm.nb(formula = cycle_count ~ weighted_average_rainfall + day_of_week + 
##     month + year + cycle_recording_site + covid_status - 1, data = auckland_cycle, 
##     init.theta = 4.069766297, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.3553  -0.7794  -0.1173   0.4658  12.3624  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## weighted_average_rainfall -0.0209893  0.0002997 -70.031  < 2e-16 ***
## year2019                   0.0339994  0.0059227   5.740 9.44e-09 ***
## year2020                   0.1127775  0.0100965  11.170  < 2e-16 ***
## year2021                   0.1318295  0.0126806  10.396  < 2e-16 ***
## year2022                   0.2689079  0.0567426   4.739 2.15e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(4.0698) family taken to be 1)
## 
##     Null deviance: 72246831  on 79449  degrees of freedom
## Residual deviance:    83839  on 79366  degrees of freedom
## AIC: 926037
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  4.0698 
##           Std. Err.:  0.0208 
## 
##  2 x log-likelihood:  -925868.5680

The coefficient for weighted_average_rainfall has increased slightly.

The residual deviance appears a lot smaller now. The dispersion parameter has been found to be 4.1.

The average effects of weighted average rainfall are: for every one-unit (1 mm) increase of rainfall, the expected count of cyclists decreases by a factor of \(e^{-0.0209893} \approx 0.979\), or about \(2.2\%\) less.

Fitted lines appear almost identical to the Poisson model, however slightly less noise, and preserving the same trend.

The residuals have less of a trend and appear much smaller for the negative binomial model compared to the Poisson model. This is probably the final model of choice.

Discussion

Interactions

I did not include interactions, and used a first order model with all the covariates of concern. An extension of this investigation would be to explore interaction effects between certain covariates and their effect on cycle count.

Zero-inflated Rainfall Measurements

There appears be zero-inflated rainfall measurements, although it would be tempting to use a zero-inflated Poisson, that is only if the dependent variable is zero-inflated. In this case we could fit a compound Poisson model and account for over-dispersion like in our negative binomial model of choice.

Seasonality

Although I included the ENSO index in the causal diagram and is a confounder for rainfall, I decided not to investigate this further in the model, as I believe the El Nino / El Nina effects are already explained by the year covariate.

Special Days

Public holidays such as Easter, Christmas, and New Years to name a few, are an important covariate which has direct effects on the cycle count. This is a covariate I would model as an extension of this analysis, that may serve to decrease noise, decreasing residual of rainfall on cycle counts.

Other Weather Covariates (Not Included in Analysis)

Rainfall is a good indicator of weather conditions favorable for cycling, however we should also consider temperate and wind speed in our model, that may serve to decrease the residual of rainfall on cycle counts. However I’m not well versed in climate science, and adjusting for these other covariates might introduce mediator in our model which could be a problem.

Weekend Effect

Based on my research, it appears that rainfall effects are more pronounced between weekdays and weekends in very large metropolitan areas. I chose not to include this covariate in the Causal DAG and model, as I suspect the weekend effect may be negligible in a smaller city like Auckland. Additionally, our rainfall data is not the most comprehensive, which might render this consideration less impactful. Even if there were a slight effect, it should be adequately accounted for by the day of week covariate.